Text Splitter & Chunker for RAG / LLMs avatar

Text Splitter & Chunker for RAG / LLMs

Pricing

from $5.00 / 1,000 text chunkeds

Go to Apify Store
Text Splitter & Chunker for RAG / LLMs

Text Splitter & Chunker for RAG / LLMs

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

Pricing

from $5.00 / 1,000 text chunkeds

Rating

0.0

(0)

Developer

Rosario Vitale

Rosario Vitale

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Split any text into clean, overlapping chunks that are ready for embeddings, vector databases, RAG pipelines and LLM context windows — without writing your own splitter.

Paste text (or send many documents), pick a chunk size and overlap, and get back tidy chunks with character counts and approximate token counts as JSON or CSV.

Why

Every RAG / LLM pipeline needs chunking, and everyone re-implements the same fiddly logic: respect paragraph and sentence boundaries, keep an overlap so context isn't lost, normalize messy whitespace, and estimate tokens. This Actor does it for you, reliably, in one call.

Features

  • ✂️ Smart chunking — packs text up to your target size while respecting paragraph/sentence boundaries.
  • 🔁 Overlap — keeps a configurable overlap so ideas spanning a boundary aren't lost.
  • 🔢 Characters or tokens — size and overlap in characters or approximate tokens (~4 chars/token).
  • 🧹 Cleaning — normalizes whitespace and collapses excessive blank lines.
  • 📦 Batch — split many documents in a single run.
  • 📊 Token estimate — every chunk includes charCount and approxTokens.

Input

FieldTypeDescription
textstringA single document to split.
textsarrayMultiple documents (one per item).
chunkSizeintegerTarget chunk size. Default 1000.
chunkOverlapintegerOverlap between chunks. Default 100.
unitselectcharacters or tokens. Default characters.
splitByselectparagraph, sentence or character. Default paragraph.
cleanbooleanNormalize whitespace. Default true.

Example input

{
"text": "Your long document text goes here...",
"chunkSize": 1000,
"chunkOverlap": 100,
"unit": "characters",
"splitBy": "paragraph",
"clean": true
}

Output

One dataset item per chunk:

{
"sourceIndex": 0,
"chunkIndex": 0,
"totalChunks": 3,
"text": "Retrieval-Augmented Generation (RAG) combines a language model ...",
"charCount": 312,
"approxTokens": 78
}

Export as JSON, CSV, or Excel, or pull via the Apify API — then send the chunks straight to your embeddings model or vector DB.

Common use cases

  • Prepare documents for embeddings + vector search (Pinecone, Qdrant, Weaviate, pgvector).
  • Build RAG context for ChatGPT/Claude apps.
  • Fit long content into LLM context windows.
  • Pairs perfectly with PDF to Structured Data — extract text from PDFs, then chunk it here.

Notes

  • Token counts are an estimate (~4 characters per token); exact tokenization depends on the model.
  • For character split mode the text is hard-cut at the size boundary; paragraph/sentence respect natural boundaries.